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Abstract 

Background: The graph-theoretical analysis of molecular networks has a long tradition in chemoinformatics. As 
demonstrated frequently, a well designed format to encode chemical structures and structure-related information of 
organic compounds is the Molfile format. But when it comes to use modern programming languages for statistical 
data analysis in Bio- and Chemoinformatics, R as one of the most powerful free languages lacks tools to process 
Molfile data collections and import molecular network data into R. 

Results: We design an R object which allows a lossless information mapping of structural information from Molfiles 
into R objects. This provides the basis to use the RMol object as an anchor for connecting Molfile data collections 
with R libraries for analyzing graphs. Associated with the RMol objects, a set of R functions completes the toolset to 
organize, describe and manipulate the converted data sets. Further, we bypass R-typical limits for manipulating large 
data sets by storing R objects in bz-compressed serialized files instead of employing RData files. 

Conclusions: By design, RMol is a R toolset without dependencies to other libraries or programming languages. It is 
useful to integrate into pipelines for serialized batch analysis by using network data and, therefore, helps to process 
sdf-data sets in R efficiently. It is freely available under the BSD licence. The script source can be downloaded from 
http://sourceforge.net/pAmol-toolset. 



Background 

To solve many tasks in Bio- and Chemoinformatics, the 
analysis of chemical and biological structures represented 
by networks has been proven powerful [1,2]. A typical 
problem in this area is to characterize the structure of 
molecular networks quantitatively by using graph mea- 
sures [3-7] or to predict physicochemical properties of the 
molecules by taking structural features into account [8] . 

For quantifying structural information of molecular net- 
works, one often needs quantitative or comparative net- 
work measures to analyze the structure of the underlying 
networks [1,2]. For instance, Dragon [9] is a commercial 
and well-known software to calculate so-called molecular 
descriptors from SD/Molfile data [10] and other data for- 
mats specializing in chemical structures. But when using 
the programming environment R [11], there is yet no 
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interface to employ structural information of molecular 
networks encoded by SD/Molfile data. 

To tackle this problem and, hence, to spread out 
the usage of R in chemically and biologically-driven 
disciplines, we develop an R toolset for transforming 
SD/Molfile structure information into R objects. As struc- 
tural information of the networks is now available in R, we 
hope that our tool may stimulate the Bioinformatics com- 
munity to explore problems centered around chemical 
and molecular networks by using existing R packages. 

Tools for graph analysis 

In this section, we briefly sketch some tools for analyzing 
graphs by using R. An extensive review of such tools can 
be found in [12,13]. 

Among other environments suitable for graph analy- 
sis [14], the script language R has gained much impor- 
tance. Not only because basic functions are allocated 
by R packages such as the packages graph [15] from 
Bioconductor [16] and igraph [17] but also because 
of packages such as QuACN [18]. The latter contributed 
extensively to analyze networks quantitatively [18] with 
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R. Note that QuACN is an R-tool for calculating ca. 150 
quantitative network measures which can be mostly inter- 
preted as complexity indices [19]. 

Also, Guha [20] developed a set of wrapper functions 
providing R user access to the functions and objects of 
the CDK [21] representing a Java framework for chem- 
informatics. The chemoinformatics package ChemMineR 
[22] written in R includes in updated versions func- 
tions which are capable of reading and extracting struc- 
tural information from different data formats includ- 
ing SD-files. But we emphasize that both packages 
are conceptionally focused on the inspection of single 



networks or comparison of limited data sets, in contrast 
to the RMol script collection which fits well into work- 
flows with serialized pipelines. To reduce dependencies, 
we extract SD-file information with our own R parser. 
Also, to avoid repetitive transformation processes we 
store the chemical and molecular network information in 
R objects. 

In this report, we present an R toolset for linking avail- 
able R functions with existing data collections represent- 
ing chemical structures by mapping structural informa- 
tion of MDL-Molfiles to an R object called RMol. Figure 1 
illustrates the lossless mapping from Molfile information 



MDL Molfile 



H2NCH2C00H Glycine 
##CCCBDB 4191205:03 

Geometry Optimized at PM3 , hydrogen-depleted 



0 0 0 0 0 0 0 0999 V2000 



0.0000 
1. 1767 
-0.8890 
-0.6104 
0.3604 



0.5332 
0.8510 
1.5518 
-0.8524 
-1.9556 



0.0000 C 0000000 

0.0000 0 0000000 

0.0000 0 0000000 

0.0000 C 0000000 

0.0000 N 0000000 




$$$$ 



entry header EH 
counts line CL 

atom block AB 

bond block BB 
end of entry marker 



RMol <- list(EH,CL,AB,BB) 
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Figure 1 Molfile < — > RMol mapping. Four structural elements (entry header EH, counts line CL, atom block AB, bond block BB) from a Molfile 
are transformed to list elements of an RMol object. Column labels of the RMol data frames CL, AB, BB are named according to the CT (Chemical 
Table) file specifications [23]. 
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to a list consisting of four elements that represents an 
RMol object. 

Results and discussion 

Besides the definition of the novel RMol object for encod- 
ing chemical structure information by using R, we develop 
a set of functions for the programming environment R 
to accomplish and facilitate efficient graph-based analy- 
sis of the underlying molecular networks, e.g., chemical 
structures. More precisely, this tool covers the following 
functionalities: 

• Importing chemical structure data from an SD-file 
(Molflle format) into R. 

• Handling of RMol data sets as serialized 

bz -compressed files (to bypass memory limits). 

• Providing simple statistics of chemical structures or 
structure data sets in RMol format. 

• A filter for selecting chemical structures and 
reorganizing data collections in RMol format. 

• Generating adjacency matrices or connection tables 
from chemical structures in RMol format. 

• Converting RMol objects into attribute-extended 
graphNEL objects. By doing so, this links directly to 
R packages for graph analysis (e.g., see graph, 
igraph, QuACN). 

In the following, we explain some items of RMol in 
more detail. The function Sdf 2 RMol has been devel- 
oped to process SD-files and convert chemical structure 
information from the Molflle portions into RMol objects. 

Concretely Sdf 2 RMol represents a working 
script, which combines an entry picking routine 
(pickSdf Entry) with an RMol specific parser 
(parseSdf Entry) using regular expressions to scan the 
Molflle sections of SD-files according to the CT-file for- 
mat specifications [23]. Moreover Sdf 2 RMol completes 
the conversion pipeline with error logging and internal 
routines for checking feeded entries for consistency and 
plausibility. Finally the resulting R objects are streamed as 
data sets into serialized bz-compressed files. 

These files are denoted with the file ending .Rbz and 
referred to as "Rbz-files". Rbz-files help to bypass R-typical 
memory limits for huge data collections and are useful 
storage containers for any R object. By design, Rbz-files 
contain the R objects as serialized list elements S[ i], where 
S[i]=list(objectname[i],objectcontent[i]). They are a useful 
data source for any R driven process pipeline. 

The functions RData2Rbz and Rbz2RData allow 
the transformation of the serial Rbz -format to standard 
RDat a-format and vice versa. For users who are not famil- 
iar with connection manipulation in R, we also include 
the functions RbzOpen, NextRbzObj ect, RbzClose 
to alleviate the handling of Rbz-files. 



To extract and summarize properties of Rbz -packed 
RMol data sets, the functions Rbz Summary and 
RbzSummaryReport are useful. To manipulate and 
split these data sets RbzFilteris available. 

The raw graphNEL class, as denned in the R package 
graph is sufficient to build representations for graphs 
without vertex and edge labels. However, to perform 
the analysis of labeled graphs (e.g., graphs representing 
chemical structures with hetero atoms and bond types) 
by using the QuACN package, graphNEL needs to be 
extended with bond and atom attributes. RMol contains 
the function RMo 1 2 QuACNgN to pack the relevant infor- 
mation into these attribute-extended graphNEL objects. 
All RMol functions are put together in one R script. After 
sourcing this script all functions will be available to sup- 
port in preparing chemical structure data for analyzing 
molecular networks. 

Conclusions 

In this report, we presented an R toolset to convert the 
structure information of molecular graphs encoded by 
SD/Molfiles into R objects. It complements existing pack- 
ages capable of reading SD-file information by easying 
batch processing and using pure R scripts without depen- 
dencies. In combination with R packages designed for 
analysing graph and network properties it represents a 
connector module for R workflows which process struc- 
ture information from SD-files. This toolset can also sup- 
port other R packages for analyzing networks structurally 
and, thus, makes a further contribution towards demon- 
strating the power of R for network analysis in Chemo- 
and Bioinformatics. 

So far, it was not common to investigate SD-file data col- 
lections by using R and packages thereof. The new toolset 
RMol may encourage the community to spread out the 
usage of R for chemically and biologically driven areas. 
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